Performance Analysis of Compiler-Parallelized Programs on Shared-Memory Multiprocessors
نویسندگان
چکیده
Shared-memory multiprocessor (SMP) machines have become widely available. As the user community grows, so does the importance of compilers that can translate standard, sequential programs onto this machine class. Substantial research has been done to develop sophisticated parallelization techniques, which can detect and exploit parallelism in many real applications. However, the performance of compiler-parallelized applications can be below expectations. The speedups of even fully-parallel codes on today's shared-memory multiprocessors can be signi cantly less than the number of processors. In this paper we will investigate reasons for such performance behavior. We will focus on three speci c issues: (1) We will determine whether it is appropriate for the preprocessor to express the detected parallelism in the common loop-oriented form, (2) we will determine sources of ine ciencies in fully parallel SMP programs that exhibit good cache locality, and (3) we will discuss the portability of these programs across SMP machines. In our experiments we have extended the Polaris compiler, so that it can generate threadbased code directly. We compare the performance of this code with Polaris' loop-parallel OpenMP output form and with architecture-speci c directive languages available on the Sun Enterprise and the SGI Origin systems. We have analyzed in detail the performance of several parallel Perfect Benchmarks. Our main ndings are as follows. (1) Overall, there is no signi cant performance disadvantage of the loop-parallel representation. (2) However, substantial performance di erences are attributable to the instruction e ciency, which is in uenced by the data sharing semantics of parallel constructs. (3) Both the OpenMP and the thread-based program forms are functionally portable, but can result in substantially di erent performance on the two machines. y This work was supported in part by DARPA contract #DABT63-95-C-0097 and NSF grants #9703180-CCR and #9872516-EIA. This work is not necessarily representative of the positions or policies of the U. S. Government.
منابع مشابه
Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple mod...
متن کاملArchitectural and Software Support for Executing Numerical Applications on High Performance Computers By
Numerical applications require large amounts of computing power. Although shared memory multiprocessors provide a cost-e ective platform for parallel execution of numerical programs, parallel processing has not delivered the expected performance on these machines. There are two crucial steps in parallel execution of numerical applications: (1) e ective parallelization of an application and (2) ...
متن کاملParallelization of NAS Benchmarks for Shared Memory Multiprocessors
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of port...
متن کاملPerformance Analysis of Compiler-Parallelized Programs on Shared-Memory Multiprocessorsy
Shared-memory multiprocessor (SMP) machines have become widely available. As the user community grows, so does the importance of compilers that can translate standard, sequential programs onto this machine class. Substantial research has been done to develop sophisticated parallelization techniques, which can detect and exploit parallelism in many real applications. However, the performance of ...
متن کاملA Compiler Optimization Algorithm for Shared-Memory Multiprocessors
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, sharedmemory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000